Chris Pollett > Old Classes >
CS267

( Print View )

Student Corner:
  [Grades Sec1]

  [Submit Sec1]

  [Class Sign Up Sec1]

  [
Lecture Notes]
  [Discussion Board]

Course Info:
  [Texts & Links]
  [Topics/Outcomes]
  [Outcomes Matrix]
  [Grading]
  [HW/Quiz Info]
  [Exam Info]
  [Regrades]
  [Honesty]
  [Additional Policies]
  [Announcements]

HW Assignments:
  [Hw1]  [Hw2]  [Hw3]
  [Hw4]  [Quizzes]

Practice Exams:
  [Mid 1]  [Mid 2]  [Final]

                           












HW#4 --- last modified February 10 2019 22:00:29..

Solution set.

Due date: Dec 9

Files to be submitted:
  Hw4.zip

Purpose: To gain experience with index compression methods, to learn about divergence from randomness, to study map reduce algorithms.

Related Course Outcomes:

The main course outcomes covered by this assignment are:

LO4 -- Give an example of how a posting list might be compressed using difference lists and gamma codes or Rice codes.

LO5 -- Demonstrate with small examples how incremental index updates can be done with log merging.

LO7 -- Know at least one Map Reduce algorithm (for example to calculate page rank).

Specification: This homework consists of four parts: (a) book problems 14.6 and 14.7 which you should put in a file Problems.pdf inside your zip file, (b) implement divergence from randomness (DFR) as a scoring method for your Hw3 code. (c) adding index compression to your code from Hw3, , and (d) implementing a log merge algorithm as an extension Hw3 Part(a) is relatively self-explanatory. For Part (b), I want your program now to take dfr as a possible value for the third argument of your program from Hw2. It should calculate DFR using equations 9.51/9.52 in the book. I want you to do trec_eval tests comparing BM25 and DFR. Put your write-up of these experiments in experiments.txt. For part (c), I want you to modify the way posting lists are stored in your index, so that they now make use of one the index compression codes (gamma, delta, rice, etc) that we have studied. When you index a folder I want you to write the compressed index to disk. I want you to code an auxiliary program called index_viewer which can be run from the command line with two arguments the file name of an index that you have written together with a word to look up. This program should then pretty-print the posting list entries for that word (i.e., a string of 1's and 0's showing what the compressed posting looked like, followed by what the uncompressed data looked like). For part (d), I want you to assume that you only have enough memory to hold 10 documents at a time in memory (this is also at merge time and you can assume documetns are all roughly the same length). I want you to code your program so that it uses the disk to keep indexing according to Figure 7.7. Furthermore, I want you to sleep 5 seconds between after you merge any pair of files in this algorithm, so that I can see the actually merge process by watching the folder involved. Finally, I want you to include with your homework a readme.txt file containing all the team members names and ids, as well as a brief description about how to find and run things for each of the four parts of this assignment.

Point Breakdown

Book problems (1pt each 0 incorrect, 0.5 partial, 1 correct) 2pts
Coding guidelines, and code elegance 1pt
DFR extension to HW3 works as described2pts
trec_eval experiments1pt
Index compression code works (1pt), index_viewer is as described (1pt)2pts
Logarithmic merge code works as described2pts
Total10pts